--- title: 'Exploratory Data Analysis: OpenBeta Climbing Data (2022)' author: "Nina" date: '2022-05-30' output: pdf_document tags: - R Markdown - plot - regression categories: - R - Visualizations ---

     The population of climbers is exploding and we need more and better access to data to make the sport more accessible. I am using data provided from OpenBeta, a nonprofit built and run by climbers that enables “open access and innovative uses of climbing data” (1). I built a Shiny app that includes a map of sport and trad climbing routes in Oregon ranked by route quality and a recommendation engine tailored to the user of any skill level.
     Following the sport of climbing’s Olympic debut in Tokyo 2021 and the success of films like Free Solo featuring Alex Honnold (2018) and The Dawn Wall featuring Tommy Caldwell (2018), the industry is seeing historic growth and opportunities for new, profitable markets. According to Forbes, Google searches that included the term “climbing” reached an all time high in the first week of August 2021; the same time frame that men’s and women’s combined events were held (2). Not only is the sport gaining a bigger audience, but it is also attracting regular people like you and me to take a crack at the crag. Following the pandemic, nearly 100 climbing gyms have opened in North America and profits of El Cap, one of America’s largest operators of climbing facilities, saw a 100% increase in online interactions (2).       One concern with the sport’s booming popularity is the barrier to entry and as a result, there has been a push to make the sport more accessible. In addition, there are only a few databases with outdoor climbing routes that are in access to the public like The Mountain Project or 8A. Without these platforms, climbers who are looking to hit their local crag or boulder may not be able to find routes or know about the quality of them if they do not already have community or word of mouth. While these websites have provided helpful tools to climbers of all experience and skill levels, they are still heavily lacking data and scrapings of these platforms have resulted in DMCA takedowns or lawsuits (3). At the bare minimum, we need better and easier access to climbing data so that data scientists like myself can work to advance the sport for others. As the sport grows so will the influx of data, and with any field that is expanding and rapidly changing, data science can add value to it by making better-informed decisions for multiple stakeholders, generating new insights about its players and audience, and increasing the overall experience for users.
      OpenBeta is a non-profit built and run by climbers that enables “open access and innovative uses of climbing data” (1). Though they have also faced several challenges with their attempt to use onX’s data from the Mountain Project with copyright infringement and blocked repositories, according to Outside Learn (3). At the moment their data is public, and Github recently reversed the DMCA takedown thanks to legal efforts from the owner, Viet Nguyen, who is “empowering the community with open license climbing betas and source tools” (1). His goal for OpenBeta is to make climbing data more like an open source project, which in turn would help platforms like Mountain Project to increase their recommendation systems, geolocation data, and the accuracy of submissions (3). In addition to pushing for accessible data, the OpenBeta also posts articles that fit the needs of any climber in STEM: tutorials, current events, and project inspirations like recommendation systems and route quality maps. The community that OpenBeta is fostering aligns heavily with the forward mentality of climbing currently which is: don’t be a gatekeeper, spread the beta, and anyone is capable.
      As a young climber and data scientist, I found myself incredibly inspired by OpenBeta’s work and wanted to support the nonprofit by using their data and some of their resources for my capstone project. I want to leverage climbing data to influence decision making for climbers of all skill sets and as a result, contribute to the overarching goal of OpenBeta which is to make the sport of climbing safer, more knowledgeable, and more accessible. Recommendation systems are extremely powerful and if done well, can be a great tool for young climbers when exploring outdoor routes. To get a better understanding of the data and to ensure the viability of this goal, I am performing an exploratory data analysis of the OpenBeta data.

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
or_ratings <- read_csv("or-ratings.csv")
## Rows: 58256 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): users, name, grade, type
## dbl (2): ratings, route_id
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
or_ratings
## # A tibble: 58,256 × 6
##    users                                    ratings  route_id name   grade type 
##    <chr>                                      <dbl>     <dbl> <chr>  <chr> <chr>
##  1 d093d0f400441c3a85a0451b2d73542ddc294142       1 118178146 Twin … 5.9   {'tr…
##  2 f56f930bec66e51f6ddfe6fa3fe57cc78fc2f732       3 119256183 Winds… 5.12a {'sp…
##  3 04fd5a96a0f8f32c25fb6935b03f730f97208cbf       3 119256183 Winds… 5.12a {'sp…
##  4 6f83adad055036f3e9213e97925bfb6062058a96       3 119256183 Winds… 5.12a {'sp…
##  5 54b89a698063861992f6c4c270728aafa8c9092f       2 119256183 Winds… 5.12a {'sp…
##  6 f766420d4f0aced01935c36c1fa431f628dad1ed       4 106266547 South… 5.2   {'tr…
##  7 5444a5e886e88e2f47fe95ec9263c76aad75f165       4 106266547 South… 5.2   {'tr…
##  8 4b7315ea87e61ec7d1297727ea1682d1f24c5495       4 106266547 South… 5.2   {'tr…
##  9 14bf58a87fae995acbc68b41e593216ac9b0051e       4 106266547 South… 5.2   {'tr…
## 10 5d469f91840fc8b7f0bd365c1ed4a84ae7601ef3       4 106266547 South… 5.2   {'tr…
## # … with 58,246 more rows
or_ratings <- or_ratings %>%
  mutate(trad = ifelse(str_extract(type, "tr") == "tr", 1, 0)) %>%
  mutate(sport = ifelse(str_extract(type, "sp") == "sp", 1, 0)) %>%
  mutate(trad = ifelse(is.na(trad), 0, trad)) %>%
  mutate(sport = ifelse(is.na(sport), 0, sport)) %>%
  filter(trad != sport) %>%
  mutate(type = ifelse(trad == 1, "trad", "sport")) %>%
  select(-trad, -sport)
or_quality <-
  read_csv("or_quality_data.csv")
## New names:
## * `` -> ...1
## Rows: 2767 Columns: 20── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): route_name, type_string, parent_sector, parent_loc, nopm_YDS, safe...
## dbl (13): ...1, route_ID, sector_ID, num_votes, adjusted_num_votes, mean_rat...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
or_quality
## # A tibble: 2,767 × 20
##     ...1 route_name    route_ID type_string sector_ID parent_sector parent_loc  
##    <dbl> <chr>            <dbl> <chr>           <dbl> <chr>         <chr>       
##  1 29960 Skelator     107358365 trad        107357989 Castle        [-121.0386,…
##  2 29961 As You Wish  107357995 trad        107357989 Castle        [-121.0386,…
##  3 29962 African Swa… 107358243 trad        107357989 Castle        [-121.0386,…
##  4 29963 Mekaneck     107358348 trad        107357989 Castle        [-121.0386,…
##  5 29964 Holy Hand G… 107358251 trad        107357989 Castle        [-121.0386,…
##  6 29965 Fezzik       107358339 trad        107357989 Castle        [-121.0386,…
##  7 29966 Inconceivab… 107358324 trad        107357989 Castle        [-121.0386,…
##  8 29967 Trojan Rabb… 107358465 trad        107357989 Castle        [-121.0386,…
##  9 29968 Orko         107358440 trad        107357989 Castle        [-121.0386,…
## 10 29969 Big Arch Co… 105842927 trad        105842915 Great Arch, … [-122.14883…
## # … with 2,757 more rows, and 13 more variables: num_votes <dbl>,
## #   adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## #   mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## #   ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## #   state <chr>
orGeo <- or_quality %>%
   mutate(lon = as.numeric(str_extract(parent_loc, '(-|)\\d+.\\d+')),
         lat = as.numeric(str_extract(parent_loc, '\\s(-|)\\d+.\\d+')))  %>%
  filter(!is.na(lat), !is.na(lon)) %>%
  mutate(route_ID = as.character(route_ID))
orGeo
## # A tibble: 2,767 × 22
##     ...1 route_name   route_ID  type_string sector_ID parent_sector parent_loc  
##    <dbl> <chr>        <chr>     <chr>           <dbl> <chr>         <chr>       
##  1 29960 Skelator     107358365 trad        107357989 Castle        [-121.0386,…
##  2 29961 As You Wish  107357995 trad        107357989 Castle        [-121.0386,…
##  3 29962 African Swa… 107358243 trad        107357989 Castle        [-121.0386,…
##  4 29963 Mekaneck     107358348 trad        107357989 Castle        [-121.0386,…
##  5 29964 Holy Hand G… 107358251 trad        107357989 Castle        [-121.0386,…
##  6 29965 Fezzik       107358339 trad        107357989 Castle        [-121.0386,…
##  7 29966 Inconceivab… 107358324 trad        107357989 Castle        [-121.0386,…
##  8 29967 Trojan Rabb… 107358465 trad        107357989 Castle        [-121.0386,…
##  9 29968 Orko         107358440 trad        107357989 Castle        [-121.0386,…
## 10 29969 Big Arch Co… 105842927 trad        105842915 Great Arch, … [-122.14883…
## # … with 2,757 more rows, and 15 more variables: num_votes <dbl>,
## #   adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## #   mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## #   ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## #   state <chr>, lon <dbl>, lat <dbl>
## # A tibble: 56,024 × 27
##    users       ratings route_id name   grade type   ...1 route_name  type_string
##    <chr>         <dbl> <chr>    <chr>  <chr> <chr> <dbl> <chr>       <chr>      
##  1 d093d0f400…       1 1181781… Twin … 5.9   trad  30010 Twin Sister trad       
##  2 f56f930bec…       3 1192561… Winds… 5.12a sport 70489 Winds of P… sport      
##  3 04fd5a96a0…       3 1192561… Winds… 5.12a sport 70489 Winds of P… sport      
##  4 6f83adad05…       3 1192561… Winds… 5.12a sport 70489 Winds of P… sport      
##  5 54b89a6980…       2 1192561… Winds… 5.12a sport 70489 Winds of P… sport      
##  6 f766420d4f…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
##  7 5444a5e886…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
##  8 4b7315ea87…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
##  9 14bf58a87f…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
## 10 5d469f9184…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
## # … with 56,014 more rows, and 18 more variables: sector_ID <dbl>,
## #   parent_sector <chr>, parent_loc <chr>, num_votes <dbl>,
## #   adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## #   mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## #   ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## #   state <chr>, lon <dbl>, lat <dbl>
##  [1] "5.0"     "5.1"     "5.2"     "5.3"     "5.4"     "5.5"     "5.6"    
##  [8] "5.7"     "5.7+"    "5.8"     "5.8-"    "5.8+"    "5.9"     "5.9-"   
## [15] "5.9+"    "5.10"    "5.10-"   "5.10+"   "5.10a"   "5.10a/b" "5.10b"  
## [22] "5.10b/c" "5.10c"   "5.10c/d" "5.10d"   "5.11"    "5.11-"   "5.11+"  
## [29] "5.11a"   "5.11a/b" "5.11b"   "5.11b/c" "5.11c"   "5.11c/d" "5.11d"  
## [36] "5.12"    "5.12-"   "5.12+"   "5.12a"   "5.12a/b" "5.12b"   "5.12b/c"
## [43] "5.12c"   "5.12c/d" "5.12d"   "5.13"    "5.13-"   "5.13+"   "5.13a"  
## [50] "5.13a/b" "5.13b"   "5.13b/c" "5.13c"   "5.13c/d" "5.13d"   "5.14"   
## [57] "5.14-"   "5.14a"   "5.14b"   "5.14c"

       In my process, I took all of the Oregon ratings from the OpenBeta. I first wanted to work with West Coast for a couple reasons. For one, the Sierra Nevada of California and the Cascade Range of the Pacific Northwest are prime western U.S. rock climbing locales. In addition the West Coast is scattered with popular climbing spots (i.e. Yosemite, Joshua Tree, Smith Rock) but there is a common misconception that these areas only have expert graded routes. In reality the opposite is true, and there are actually more beginner to moderate routes than expert ones.

We are only using routes with grades that are in the Yosemite Decimal System, which is the traditional difficulty rating for routes in the US. According to the Yosemite Decimal System, a 5.0 to 5.7 is considered easy, 5.8 to 5.10 is considered intermediate, 5.11 to 5.12 is hard, and 5.13 to 5.15 is reserved for a very elite few. This means that the app will be of use to any climber, as some local classics come even in lower grades.

       Furthermore, I decided to hone in on Oregon routes because of the national coverage and booming popularity of California routes. Little people know that the birthplace of sport climbing was brought to Oregon’s Smith Rock State Park by Alan Watts, a famous climbing pioneer, in 1986 (7). Not only is some of the best climbing in the nation found in Oregon’s Smith Rock, there are also hidden gems right in Portland’s backyard that I had to do significant research as a climber to find. It would be even harder for a new climber to discover some of these on their own. Not to mention that most of these routes are for climbers of all skill sets. The idea of my project is that anyone can have access to local classics even in lower grades.

       One dataset I used from OpenBeta contains all route ratings in Oregon along with the route ID, grade, name, and type (trad, sport, ice, bouldering, etc.).I’m going to look at only trad and sport routes on the West Coast, which are the most two popular types of outdoor climbing. A limit of this method is that we lose routes that could be both sport and trad. Another dataset I used had aggregate rating data from OpenBeta along with the location of parent walls to use for plotting. The features I used from this dataset are the parent wall ID, name, and location along with the state and ARQI rating (this metric is explained later).

      We can see that our data is dominated by sport routes (after all, Oregon is the birth place of sport climbing). This raised a bias concern with an Item Based Collaborative Filtering recommendation system. With that being said, sport climbing is easily the most popular form of climbing nowadays. Not only is trad climbing out of date and mostly done by the pros, it is also extremely expensive and a result, a barrier to entry to the sport itself. Therefore I felt okay about this data imbalance when making a recommendation to a new user: they probably don’t want to be reccommended trad routes for lower levels.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

      As part of my exploratory data analysis, I also wanted to get a breakdown of classic routes. As a metric for route quality, we can look at the aggregate metric RQI or ARQI. The RQI is equal to S(1-1/N) where S is the average stars (or median) and N is the number of votes. As N approaches infinity, (1-1/N) approaches 1 and RQI approaches S. One issue with this metric is that harder routes get fewer ascents and therefore less votes, making it difficult for hard routes to make it into the “classic” class. We will use the Adjusted RQI (ARQI), which corrects for bias of RQI towards easier routes by adjusting the number of votes and therefore doesn’t make route quality a “popularity metric.” The ARQI is equal to S(1-1/Nw) where Nw is the number of weighted or adjusted votes and is determined by the votes-per-route for each grade.         According to the OpenBeta, the categories for route quality are the following: - Classic: ARQI >= 3.5 - Area Classic: 2.5 <= ARQI < 3.5 - Good: 1.5 <= ARQI < 2.5 - Bad: 0.5 >= ARQI < 1.5 - Bomb: ARQI < 0.5

or_ratings <- or_ratings %>% 
  distinct() %>%
  mutate(class = case_when(
    ARQI_median >= 3.5 ~ "classic",
    ARQI_median >= 2.5 & ARQI_median < 3.5 ~ "area classic",
    ARQI_median >= 1.5 & ARQI_median < 2.5 ~ "good",
    ARQI_median >= 0.5 & ARQI_median < 1.5 ~ "bad",
    ARQI_median < 0.5 ~ "bomb")) %>%
  group_by(parent_sector) %>%
  mutate(best_route = route_name[which.max(ARQI_median)]) %>%
  ungroup()
    
or_ratings
## # A tibble: 55,994 × 30
##    users       ratings route_id name   grade type   ...1 route_name  type_string
##    <chr>         <dbl> <chr>    <chr>  <fct> <chr> <dbl> <chr>       <chr>      
##  1 d093d0f400…       1 1181781… Twin … 5.0   trad  30010 Twin Sister trad       
##  2 f56f930bec…       3 1192561… Winds… 5.1   sport 70489 Winds of P… sport      
##  3 04fd5a96a0…       3 1192561… Winds… 5.1   sport 70489 Winds of P… sport      
##  4 6f83adad05…       3 1192561… Winds… 5.1   sport 70489 Winds of P… sport      
##  5 54b89a6980…       2 1192561… Winds… 5.1   sport 70489 Winds of P… sport      
##  6 f766420d4f…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
##  7 5444a5e886…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
##  8 4b7315ea87…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
##  9 14bf58a87f…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
## 10 5d469f9184…       4 1062665… South… 5.2   trad  96351 South Ridg… trad       
## # … with 55,984 more rows, and 21 more variables: sector_ID <dbl>,
## #   parent_sector <chr>, parent_loc <chr>, num_votes <dbl>,
## #   adjusted_num_votes <dbl>, mean_rating <dbl>, median_rating <dbl>,
## #   mode_rating <dbl>, RQI_mean <dbl>, RQI_median <dbl>, ARQI_mean <dbl>,
## #   ARQI_median <dbl>, nopm_YDS <chr>, YDS_rank <dbl>, safety <chr>,
## #   state <chr>, lon <dbl>, lat <dbl>, level <fct>, class <chr>,
## #   best_route <chr>

Note that some routes with a lower ARQI may have a higher median rating. The ARQI takes the number of votes into consideration. This allows for a more accurate and fair route designation (we don’t want just any route falling into a classic).

or_ratings$class <- factor(or_ratings$class, 
                                  levels = c("classic", "area classic", "good", "bad", "bomb"))
or_ratings %>% 
  filter(num_votes < 400) %>% #filter outliers
  select(num_votes, median_rating, class) %>%
  distinct() %>%
  ggplot(aes(num_votes, median_rating, group = class)) +
  geom_point(aes(color = class), alpha = 0.6, position = position_jitterdodge(jitter.width = .9, jitter.height = 0.1)) +
  scale_color_brewer(palette = "YlOrRd") +
  theme_tufte() +
  labs(x = "\nNumber of Votes\n", y = "\nMedian Rating\n", title = "Median Ratings vs Number of Votes by ARQI Class\n")

At the state level, we found that a majority of routes fall in the Good to Area Classic range with outliers in the Bad class. But what about by grade?

Class Distribution Across Grade

      We find that a majority of easy and intermediate routes are both classic and area classics, which supports my claim that anyone can climb a classic not only at their local crag but additionally famous big walls climbed by the legends.

or_ratings %>%
  group_by(level, class) %>%
  ggplot(aes(x = level)) +
  geom_bar(aes(fill = class), position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  theme_tufte() + 
  labs(x = "\nLevel\n", y = "Number of Routes\n", title = "Distribution of Route Classes by Grade Level")

Plot routes

      I plan to combine a recommendation and route ranking system in a map format using geolocation and ratings data. Specifically, I will use Plotly for interactivity, Mapbox for geocoding, and Shiny for construction of the web application. Here I am accessing my public token from Mapbox in order to do some basic plotting and interactivity with Plotly.

library(mapboxapi)
## Warning: package 'mapboxapi' was built under R version 4.1.2
## Usage of the Mapbox APIs is governed by the Mapbox Terms of Service.
## Please visit https://www.mapbox.com/legal/tos/ for more information.
my_token <- 'pk.eyJ1Ijoibmhlcm5hbmRlejE5OTkiLCJhIjoiY2wzZGZjZDEwMDFyajNjbDVxMnJ2M2lwdSJ9.N9R9fzcgvK1ieQ_s5eVwQw'

#mb_access_token(my_token, install = TRUE, overwrite = TRUE)
#readRenviron("~/.Renviron")

Sys.setenv('MAPBOX_PUBLIC_TOKEN' = my_token)

Sys.getenv('MAPBOX_PUBLIC_TOKEN')
## [1] "pk.eyJ1Ijoibmhlcm5hbmRlejE5OTkiLCJhIjoiY2wzZGZjZDEwMDFyajNjbDVxMnJ2M2lwdSJ9.N9R9fzcgvK1ieQ_s5eVwQw"

Plot all West Coast walls

With some basic plotly commands, we can plot all the West Coast routes. Ideally the user will be able to filter grade range, type, location, and rating on the Shiny app.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
fig <- or_ratings %>%
  plot_ly(lat = ~lat, 
          lon = ~lon, 
          mode = 'markers',
          type = 'scattermapbox',
          color = or_ratings$class,
          hoverinfo = 'text',
          text = paste("Parent wall: ", or_ratings$parent_sector,
                       "<br>",
                       "Best Route: ", or_ratings$best_route,
                       "<br>",
                       "Type: ", or_ratings$type_string, 
                       "<br>", 
                       "Class: ", or_ratings$class,
                       "<br>",
                       "ARQI: ", or_ratings$ARQI_median,
                       "<br>",
                       "Grade: ", or_ratings$grade)
                       ) %>%
  layout(
    mapbox = list(
      style = 'open-street-map', # or 'light'
      zoom = 5,
      center = list(lon = -120, lat = 44)
    )
  ) %>%
  config(mapboxAccessToken = Sys.getenv("MAPBOX_PUBLIC_TOKEN"))

fig

Sport

Now we plot by type using ARQI_median as metric for ‘route quality.’ Here I’m doing some basic data wrangling to find the route with the best ARQI for parent walls with sport routes.

metric <- "ARQI_median"
sport <- 'sport'

df_sport <-orGeo %>%
  filter(type_string == sport) %>%
  group_by(sector_ID) %>%
  select(sector_ID, route_name, nopm_YDS, safety, metric)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(metric)` instead of `metric` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
name_sport <- orGeo %>%
  select(parent_sector, sector_ID, lon, lat) %>%
  filter(!duplicated(sector_ID))

agg_sport <- inner_join(df_sport, name_sport) 
## Joining, by = "sector_ID"
orGeo_sport <- agg_sport %>%
  group_by(parent_sector) %>%
  mutate(n_routes = length(route_name)) %>%
  mutate(best_route_name = route_name[which.max(ARQI_median)])

orGeo_sport
## # A tibble: 1,741 × 10
## # Groups:   parent_sector [234]
##    sector_ID route_name    nopm_YDS safety ARQI_median parent_sector   lon   lat
##        <dbl> <chr>         <chr>    <chr>        <dbl> <chr>         <dbl> <dbl>
##  1 105842915 Tyler's Route 5.10a    <NA>         0.239 Great Arch, … -122.  44.3
##  2 105842915 Black Kettle  5.10a    <NA>         3.50  Great Arch, … -122.  44.3
##  3 105842915 Phadra        5.10c    <NA>         3.26  Great Arch, … -122.  44.3
##  4 105842915 Forked Route  5.10c    <NA>         2.56  Great Arch, … -122.  44.3
##  5 105842915 Choppin' at … 5.10c    <NA>         1.53  Great Arch, … -122.  44.3
##  6 105842915 Captain Cour… 5.10c    <NA>         2.33  Great Arch, … -122.  44.3
##  7 105842915 Right Corner… 5.9      <NA>         2.74  Great Arch, … -122.  44.3
##  8 105842915 Navgrin       5.9      <NA>         2.42  Great Arch, … -122.  44.3
##  9 105842915 Unchained     5.11a    <NA>         2.53  Great Arch, … -122.  44.3
## 10 105842915 Solstice Par… 5.11a    <NA>         2.75  Great Arch, … -122.  44.3
## # … with 1,731 more rows, and 2 more variables: n_routes <int>,
## #   best_route_name <chr>

Plot

After setting various filters on the app, a map colored and sized by route quality would be provided to help the climber find the best possible routes in their ideal range.

orGeo_sport %>%
  plot_ly(lat = ~lat, 
          lon = ~lon, 
          mode = 'markers',
          type = 'scattermapbox',
          text = orGeo_sport$parent_sector,
          hoverinfo = 'text',
          color = ~ orGeo_sport$ARQI_median,
          size = ~ orGeo_sport$ARQI_median) %>%
  layout(
     mapbox = list(
      style = 'open-street-map', # or 'light'
      zoom = 5,
      center = list(lon = -120, lat = 44)
    )
  ) %>%
  config(mapboxAccessToken = Sys.getenv("MAPBOX_PUBLIC_TOKEN"))
## Warning: `line.width` does not currently support multiple values.
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf

Trad

I did a similar process for trad routes.

trad <- 'trad'

df_trad <- orGeo %>%
  filter(type_string == trad) %>%
  group_by(sector_ID) %>%
  select(sector_ID, route_name, nopm_YDS, safety, metric)

name_trad <- orGeo %>%
  select(parent_sector, sector_ID, lon, lat) %>%
  filter(!duplicated(sector_ID))

agg_trad <- inner_join(df_trad, name_trad) 
## Joining, by = "sector_ID"
orGeo_trad <- agg_trad %>%
  group_by(parent_sector) %>%
  mutate(n_routes = length(route_name)) %>%
  mutate(best_route_name = route_name[which.max(ARQI_median)])

orGeo_trad
## # A tibble: 1,026 × 10
## # Groups:   parent_sector [190]
##    sector_ID route_name    nopm_YDS safety ARQI_median parent_sector   lon   lat
##        <dbl> <chr>         <chr>    <chr>        <dbl> <chr>         <dbl> <dbl>
##  1 107357989 Skelator      5.7      <NA>         1.17  Castle        -121.  44.0
##  2 107357989 As You Wish   5.7      PG           1.46  Castle        -121.  44.0
##  3 107357989 African Swal… 5.7      <NA>         0.874 Castle        -121.  44.0
##  4 107357989 Mekaneck      5.8      <NA>         0.519 Castle        -121.  44.0
##  5 107357989 Holy Hand Gr… 5.8      <NA>         0.260 Castle        -121.  44.0
##  6 107357989 Fezzik        5.6      <NA>         1.15  Castle        -121.  44.0
##  7 107357989 Inconceivable 5.9      <NA>         1.53  Castle        -121.  44.0
##  8 107357989 Trojan Rabbit 5.5      <NA>         0.988 Castle        -121.  44.0
##  9 107357989 Orko          5.5      <NA>         0.988 Castle        -121.  44.0
## 10 105842915 Big Arch Cor… 5.7      <NA>         1.17  Great Arch, … -122.  44.3
## # … with 1,016 more rows, and 2 more variables: n_routes <int>,
## #   best_route_name <chr>

Plot

orGeo_trad %>%
  plot_ly(lat = ~lat, 
          lon = ~lon, 
          mode = 'markers',
          type = 'scattermapbox',
          text = orGeo_trad$parent_sector,
          hoverinfo = 'text', 
          color = ~ orGeo_trad$ARQI_median,
          size = ~ orGeo_trad$ARQI_median) %>%
  layout(
     mapbox = list(
      style = 'open-street-map', # or 'light'
      zoom = 5,
      center = list(lon = -120, lat = 44)
    )
  ) %>%
  config(mapboxAccessToken = Sys.getenv("MAPBOX_PUBLIC_TOKEN"))
## Warning: `line.width` does not currently support multiple values.
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf

Recommendations

For the recommendation system, I will create an item based collaborative-filtering recommender which asks the question “for users who climbed route x, which routes did they also climb?” and can predict routes based on past preferences of other users (1). From what I’ve seen on famous route finders like theCrag or MP, these platforms do not implement recommendation systems for their routes (4, 5). They usually order the routes by popularity (average ratings or number of votes) but any data analyst using mean, median, or count as a metric for popularity should know to always consider outliers, skewed data, and relative proportions. In addition, I think having a simple recommendation system would be ideal for new climbers looking to find their first projects. I believe that a recommendation system combined with a map of route quality by the AQRI score also benefits the experienced climber too. For example, if they disagree with the location, rating or quality assessment of a certain route and as a result, a failed recommendation to the climber, the user can enter more data into the Mountain Project (where OpenBeta gets its data) which they believe is more accurate. When the data funneling into the model becomes more accurate, you get a better recommendation, a better user experience, increased retention, and so on.

Item based recommendation

My main reference for creating a simple item based reccomendation comes from (6). Here I am taking the complete cases of my entire ratings dataset. Since recommendation systems are so computationally heavy, I had to get rid of any observations with nulls to decrease the load. This is definitely a limitation to the accuracy of the recommendation.

or_ratings <- or_ratings[complete.cases(or_ratings),]

Find a route we care about

We are going to choose a route to get a reccomendation for in the California area. We will choose the “most popular route” by the sum of votes.

or_ratings %>%
  group_by(route_id) %>%
  summarise(sum = sum(num_votes)) %>%
  arrange(desc(sum)) 
## # A tibble: 191 × 2
##    route_id    sum
##    <chr>     <dbl>
##  1 105790438 48841
##  2 105826816  8649
##  3 105812662  3969
##  4 105793286  3249
##  5 105930247  2304
##  6 106266547  2209
##  7 111750995  1089
##  8 106311211   900
##  9 105828698   784
## 10 107118805   484
## # … with 181 more rows

Overview

Peanut Brittle: Oregon easy 5.8; good sport route at the Peanut Wall in Smith Rock

IBCF answers the question “climbers who climbed Peanut Brittle also climbed…?”

or_ratings %>% 
  filter(route_id == "105790438")  %>%
  select(route_id, route_name, grade, type, parent_sector, class, level) %>%
  slice(1)
## # A tibble: 1 × 7
##   route_id  route_name     grade type  parent_sector  class level       
##   <chr>     <chr>          <fct> <chr> <chr>          <fct> <fct>       
## 1 105790438 Peanut Brittle 5.8+  sport (h) The Peanut good  intermediate

Create user-product matrix

First we spread out our users, route ID, and ratings across a pivot table that we convert to a simple matrix, or the user-product matrix, for calculating similarity scores.

or_wide <- or_ratings %>%
  select(users, route_id, ratings) %>%
  distinct() %>%
  pivot_wider(names_from = route_id, values_from = ratings)

row.names(or_wide) <- or_wide$users
## Warning: Setting row names on a tibble is deprecated.
or_wide$users <- NULL
or_mat <- (as.matrix(or_wide))
or_mat[1:3, 1:3]
##      106266547 111750995 106966352
## [1,]         4        NA        NA
## [2,]         4         4         4
## [3,]         4        NA        NA

Calculate degree of sparsity

99% of cells lack data… another limitation to this method. But we may be able to tackle this issue with cosine similarity.

sum(is.na(or_mat))/(ncol(or_mat) * nrow(or_mat))
## [1] 0.9905141

Use cosine similarity to measure distance

We use cosine similarity to measure distances versus the Euclidean distance because cosine looks at directional similarity rather than magnitudal differences. Here I am hoping to capture beyond the numbers and get the content that the numbers tell. For example, a route that gets a rating of 3.0 four times and 4.0 eight times will have a 100% similarity score to a route that has one 3.0 ratings and two 4.0 ratings. If we were computing euclidean distance, this would give us a similarity of only 13%. I’m hoping that this method can be viable in tackling the data sparsity issue.

library(lsa)
## Warning: package 'lsa' was built under R version 4.1.2
## Loading required package: SnowballC
route_x <- c(4, 8)
route_y <- c(1, 2)

cosine(route_x, route_y) # cosine
##      [,1]
## [1,]    1
1/(1 + sqrt((1-4)^2 + (2-8)^2)) # euclidean
## [1] 0.1297319

Compute across a matrix

We then can use a function to compute the simlarity for various routes in our matrix.

cos_similarity = function(A,B){
  num = sum(A *B, na.rm = T)
  den = sqrt(sum(A^2, na.rm = T))*sqrt(sum(B^2, na.rm = T)) 
  result = num/den

  return(result)
}

Apply this function to obtain Product-Product matrix

To prevent memory overload, we create a function to calculate the similarity only on the route we choose.

route_recommendation = function(route_id, rating_matrix = or_mat, n_recommendations = 5){

  route_index = which(colnames(rating_matrix) == route_id)

  similarity = apply(rating_matrix, 2, FUN = function(y) 
                      cos_similarity(rating_matrix[,route_index], y))

  recommendations = tibble(ID = names(similarity), 
                               similarity = similarity) %>%
    filter(ID!= route_id) %>% 
    top_n(n_recommendations, similarity) %>%
    arrange(desc(similarity)) 

  return(recommendations)

}

Get reccomendations for some route

Our function returns the top 5 similar routes to Snake Dike.

my_route <- "105790438"
recommendations = route_recommendation(my_route)
recommendations
## # A tibble: 5 × 2
##   ID        similarity
##   <chr>          <dbl>
## 1 105812662     0.122 
## 2 106776023     0.109 
## 3 105826816     0.0940
## 4 106303364     0.0842
## 5 107118805     0.0807

Join with ratings data

Next, we can join back to our original data to get information about the recommended routes. To build upon this method, I could implement machine learning frameworks like caret or tidymodels in R such as the K-Nearest Neighbors algorithm to compute the cosine similarity more accurately with cross validation and hyperparameter tuning. My biggest concern is that using this method will overload the Shiny app.

or_sub <- or_ratings %>%
  mutate(route_id = as.character(route_id))

rec_tbl <- recommendations %>%
  mutate(ID = as.character(ID)) %>%
  left_join(or_sub, by = c("ID" = "route_id")) %>%
  select(ID, name, similarity, grade, type, state, sector_ID, parent_sector, lon, lat, grade, ARQI_median, class, level) %>%
  distinct() 

rec_tbl
## # A tibble: 5 × 13
##   ID     name   similarity grade type  state sector_ID parent_sector   lon   lat
##   <chr>  <chr>       <dbl> <fct> <chr> <chr>     <dbl> <chr>         <dbl> <dbl>
## 1 10581… Cave …     0.122  5.4   trad  Oreg… 118897491 Southside     -121.  44.4
## 2 10677… West …     0.109  5.3   trad  Oreg… 118897491 Southside     -121.  44.4
## 3 10582… Sky R…     0.0940 5.8+  trad  Oreg… 105826813 (1) Northeas… -121.  44.4
## 4 10630… South…     0.0842 5.3   trad  Oreg… 118897491 Southside     -121.  44.4
## 5 10711… Mines…     0.0807 5.8+  trad  Oreg… 105865164 (3) Hand Job… -121.  44.4
## # … with 3 more variables: ARQI_median <dbl>, class <fct>, level <fct>

Conclusion

     Another limitation of my project is the restriction placed on the data I’m using. Following a legal battle with onX regarding a copyright infringement, which OpenBeta won (as noncommercial and educational factual data cannot be copyrighted), OpenBeta is working to release their data under a public domain, permissive license (1,3). For this reason, all of OpenBeta’s current datasets only include user ratings up to 2020, and it doesn’t seem like they will be able to update them by August. Therefore, there is a missed opportunity for generating better recommendations without input following the increase of climbers after Tokyo. In the future, climbing data from the Mountain Project will be streamed directly into the model prior to rollouts and will allow for optimal and more accurate results or recommendations.
      In summary, I plan to provide a recommendation engine and route quality mapping system to climbers to streamline the route searching process for climbers of varying skill sets. I truly believe that any stakeholder (like athletes, sponsors, spectators, media, businesses, brands, participants, etc.) could benefit from this project. Guiding the recent explosion of climbers properly could help make the sport extremely profitable and the climbing community greater and diverse. I hope to help dissolve barriers to entry by providing a tool to climbers that can be utilized as a spot to keep them satisfied, safe, and yearning for more. For the data community, I also want to promote the open source movement in software and data as I strongly believe that it is essential in encouraging innovation, attracting diverse talent, and broadening perspectives within tech.

  1. https://OpenBeta.io/
  2. https://www.forbes.com/sites/michellebruton/2021/11/24/interest-in-climbing-and-gym-memberships-have-spiked-following-sports-tokyo-olympics-debut/?sh=3daaf24326a8
  3. https://www.climbing.com/news/mountain-project-OpenBeta-and-the-fight-over-climbing-data-access/
  4. https://www.thecrag.com/
  5. https://www.mountainproject.com/
  6. https://anderfernandez.com/en/blog/how-to-code-a-recommendation-system-in-r/
  7. https://www.climbing.com/videos/pioneering-smith-rock-alan-watts-and-the-birth-of-us-sport-climbing/